19,549 research outputs found
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
Longitudinal trends in prostate cancer incidence, mortality, and survival of patients from two Shanghai city districts: a retrospective population-based cohort study, 2000-2009.
BackgroundProstate cancer is the fifth most common cancer affecting men of all ages in China, but robust surveillance data on its occurrence and outcome is lacking. The specific objective of this retrospective study was to analyze the longitudinal trends of prostate cancer incidence, mortality, and survival in Shanghai from 2000 to 2009.MethodsA retrospective population-based cohort study was performed using data from a central district (Putuo) and a suburban district (Jiading) of Shanghai. Records of all prostate cancer cases reported to the Shanghai Cancer Registry from 2000 to 2009 for the two districts were reviewed. Prostate cancer outcomes were ascertained by matching cases with individual mortality data (up to 2010) from the National Death Register. The Cox proportional hazards model was used to analyze factors associated with prostate cancer survival.ResultsA total of 1022 prostate cancer cases were diagnosed from 2000 to 2009. The average age of patients was 75 years. A rapid increase in incidence occurred during the study period. Compared with the year 2000, 2009 incidence was 3.28 times higher in Putuo and 5.33 times higher in Jiading. Prostate cancer mortality declined from 4.45 per 105 individuals per year in 2000 to 1.94 per 105 in 2009 in Putuo and from 5.45 per 105 to 3.5 per 105 in Jiading during the same period. One-year and 5-year prostate cancer survival rates were 95% and 56% in Putuo, and 88% and 51% in Jiading, respectively. Staging of disease, Karnofsky Performance Scale Index, and selection of chemotherapy were three independent factors influencing the survival of prostate cancer patients.ConclusionsThe prostate cancer incidence increased rapidly from 2000 to 2009, and prostate cancer survival rates decreased in urban and suburban Chinese populations. Early detection and prompt prostate cancer treatment is important for improving health and for increasing survival rates of the Shanghai male population
A Fully Polynomial Time Approximation Scheme for the Replenishment Storage Problem
The Replenishment Storage problem (RSP) is to minimize the storage capacity
requirement for a deterministic demand, multi-item inventory system where each
item has a given reorder size and cycle length. The reorders can only take
place at integer time units within the cycle. This problem was shown to be
weakly NP-hard for constant joint cycle length (the least common multiple of
the lengths of all individual cycles). When all items have the same constant
cycle length, there exists a Fully Polynomial Time Approximation Scheme
(FPTAS), but no FPTAS has been known for the case when the individual cycles
are different. Here we devise the first known FPTAS for the RSP with different
individual cycles and constant joint cycle length
- …